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Abstract 

Zero-sum stochastic games are easy to solve as they can be cast as simple Markov decision processes. This 
is however not the case with general-sum stochastic games. A fairly general optimization problem formulation 
is available for general-sum stochastic games by Filar and Vrieze [2004], However, the optimization problem 
there has a non-linear objective and non-linear constraints with special structure. Since gradients of both the 
objective as well as constraints of this optimization problem are well defined, gradient based schemes seem to be 
a natural choice. We discuss a gradient scheme tuned for two-player stochastic games. We show in simulations 
that this scheme indeed converges to a Nash equilibrium, for a simple terrain exploration problem modelled as a 
general-sum stochastic game. However, it turns out that only global minima of the optimization problem corre¬ 
spond to Nash equilibria of the underlying general-sum stochastic game, while gradient schemes only guarantee 
convergence to local minima. We then provide important necessary conditions for gradient schemes to converge 
to Nash equilibria in general-sum stochastic games. 

Keywords: Game theory. Nonlinear programming, Non-convex constrained problems. Discounted cost crite¬ 
ria, General-sum stochastic games, Nash equilibrium. 


1 Introduction 

Game theory is seen as a useful means to handle multi-agent scenarios. Since the seminal work of Shapley [1953], 
stochastic games have been an important class of models for multi-agent systems. A comprehensive treatment of 
stochastic games under various payoff criteria is given by Filar and Vrieze [2004], Many interesting problems like 
fishery games, advertisement games, etc., can be modelled as stochastic games, see Filar and Vrieze [2004], One 
of the significant results is that every stochastic game has a Nash equilibrium and it can be characterised in terms 
of global minima of a suitable mathematical programming problem. 

As an application of general-sum games to the multi-agent scenario, Singh et al. [2000] observed that in a 
two-agent iterated general-sum game, Nash convergence is assured either in strategies or in the very least in av¬ 
erage payoffs. Later by Hu and Wellman [1999], stochastic game theory was observed to be a better framework 
for multi-agent scenarios as it could be viewed as an extension of the well studied Markov decision theory (see 
Bertsekas [1995]). However, in the stochastic game setting, general-sum games are difficult to solve as, unlike 
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zero-sum games, they cannot be cast in the framework similar to Markov decision processes. Hu and Wellman 
[1999] proposed an interesting Q-learning algorithm, that is based on reinforcement learning (see Bertsekas and 
Tsitsiklis [1996]). However, their algorithm assures convergence only if the game has exactly one Nash equilib¬ 
rium. An extension to the above algorithm called NashQ was proposed by Hu and Wellman [2003] which showed 
improvement in performance. However, convergence in a scenario with multiple Nash equilibria was not ad¬ 
dressed. Another noteworthy work is of Littman [2001] who proposed a friend-or-foe Q-learning (FFQ) algorithm 
as an improvement over the NashQ algorithm with assured convergence, though not necessarily to a Nash equi¬ 
librium. Moreover, the FFQ algorithm is applicable to a restricted class of games where either full co-operation 
between agents is ensured or the game is zero-sum. Algorithms for some specific cases of stochastic games such 
as Additive Reward and Additive Transition (AR-AT) games are discussed by Filar and Vrieze [2004] as well as 
Breton et al. [1986]. 

A new type of approach based on homotopy is proposed by Herings and Peeters [2004]. In this approach, a 
homotopic path between equilibrium points of N independent MDPs and the V- pi aver stochastic game in question, 
is traced numerically giving a Nash equilibrium point of the stochastic game of interest. Their result applies to 
both normal form as well as extensive form games. However, this approach has a complexity similar to that of 
typical gradient descent schemes discussed in this paper. For more recent developments in this direction, see the 
work by Herings and Peeters [2006] and Borkovsky et al. [2010]. 

A recent approach for the computation of Nash equilibria is given by Akchurina [2009] in which a reinforce¬ 
ment learning type scheme is proposed. Though their experiments do show convergence in a large group of 
randomly generated games, a formal proof of convergence has not been provided. 

For general-sum stochastic games, [Breton et al., 1986, Section 4.3] provides an interesting optimization prob¬ 
lem with non-linear objectives and linear constraints whose global minima correspond to Nash equilibria of the 
underlying general-sum stochastic game. However, since the objective is not guaranteed to be convex, simple 
gradient descent techniques might not converge to a global minimum. Mac Dermed and Isbell [2009] formulate 
intermediate optimization problems, called Multi-Objective Linear Programs (MOLPs), to compute Nash equilib¬ 
ria as well as Pareto optimal solutions. However, as mentioned in that paper, the complexity of their algorithm 
scales exponentially with the problem size. Thus, their algorithm is tractable only for small sized problems with a 
few tens of states. 

Another non-linear optimization problem for computing Nash equilibria in general-sum stochastic games has 
been given by Filar and Vrieze [2004]. We begin with this optimization problem by discussing it in Section 2. 
Gradient based techniques are quite common for solving optimization problems. In the optimization problem, 
gradients of (both) the objective and all constraints w.r.t. the value vector v and strategy vector 7r, are well defined. 
A possible solution approach is to apply simple gradient based techniques to solve these optimization problems. 
We look at a possible gradient descent scheme in Section 3. In the process of construction of this scheme, several 
initial hurdles for gradient descent schemes are discussed and addressed. We then consider an example problem 
of terrain exploration, modelled as a general-sum stochastic game, in Section 4. In the same section, we show 
via simulation that the gradient descent scheme of Section 3 does indeed give a Nash equilibrium solution to the 
terrain exploration problem. In our case, the optimization problem at hand has only global minima correspond to 
Nash equilibria of the underlying general-sum stochastic game as discussed later in Section 2.4. It is well known 
that gradient descent schemes can only guarantee convergence to local minima. But, in the optimization problems 
that we consider, global minima are desired. So, a question remains: Are simple gradient descent schemes good 
enough to give Nash equilibria in the aforementioned optimization problems? In other words, are there only global 
minimum points in these optimization problems, so that simple gradient descent schemes can easily work? We 
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address this issue in Section 5. Finally, in Section 6 we provide the concluding remarks. 


2 The Optimization Problem 

The framework of a general-sum stochastic game is described in Section 2.1. A basic idea of the optimization 
problem is given in Section 2.2. The full optimization problem is then formulated in Section 2.3 for the infinite 
horizon discounted reward setting. Some important results by Filar and Vrieze [2004] that are applicable here are 
then described. 

2.1 Stochastic Games 

A two-agent scenario is considered in the following formulation. One can in general consider an iV-agent scenario 
for N > 2. We assume N = 2 only for notational simplicity. We interchangeably use the terms ‘agent’ and 
‘player’ to mean the same entity in the description below. We assume that the stochastic game terminates in a 
finite but random time. Hence, a discounted value framework in dynamic programming has been chosen for the 
optimization problem. 

A stochastic game is described via a tuple < S, A,p, r >. The quantities in the tuple are explained through the 
description below. 

(i) S denotes the state space. 
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(ii) A l (x) denotes the action space for the i' h agent, i — 1, 2. A{x) = x A l (x), the Cartesian product, is the 

i—1 

aggregate action space consisting of all possible actions of both agents when the state of the game is x € S. 

(iii) p(y\x, a) denotes the probability of going from the state x £ S at the current instant to y £ S at the immediate 
next instant when the action a £ A(x) is chosen. 

(iv) Finally, r(x, a) denotes the vector of reward functions of both agents when the state is x £ S and the vector 
of actions a £ *4(:r) is chosen. 

For an infinite horizon discounted reward setting, a discount factor 0 < /? < 1 is also included in the tuple 
describing the game. As is clear from this definition, a stochastic game can be viewed as an extension of the 
single-agent Markov decision process. 

A strategy it 1 = (rt \, Tt \, • • ■, it \,... ) of the j th player in a stochastic game prescribes the action to be performed 
in each state at each time instant t, by that player. We denote by 7r t *(-) the action prescribed for the i th agent by 
the strategy tt'[ at time instant t. The quantity *•’ in 7rJ(-), in general, corresponds to the entire history of states 
and actions of all agents up to the (t — l) st instant and the current system state at the f th instant. Let the set 
of all possible strategies for the i >h player be denoted by T'. A strategy it 1 of player i, is said to be a Markov 
strategy if it l t depends only on the current state x t £ S at time t. Thus, for a Markov strategy tt' of player i, 
4( x ) £ A l (x),\/t > 0, x € S, i = 1, 2. If the action chosen in any state for a Markov strategy n l is independent 
of the time instant i, viz., tt] = 7f l , Vf > 0, * = 1, 2, for some 7f such that 7f*(a;) £ A l {x),\/x £ S, then the strategy 
is said be stationary. Henceforth, we shall restrict our attention to stationary strategies only. By abuse of notation, 
we denote by n itself the stationary strategy. Extending this to all players, we denote a Markov strategy-tuple by 

Tt = (7T 1 , 7T 2 ) 

Let A(_4(a;)) (resp. A(„4®(a;))) denote the set of all probability measures on A(x) (resp. A l (x)). A randomized 
Markov strategy is specified via the sequence of maps (f> l t : S —t A(„4®(a:)), x £ S, t > 0, i = 1, 2. Thus, (j> l t {x) is 
a distribution on the set of actions A l ix) and in general depends on time instant t. We say that <j>‘ t is a stationary 
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randomized strategy or simply a randomized strategy for player i if <p\ = 4> l . By an abuse of notation, we denote 
by 7T = (tt 1 ,^ 2 ), a stationary randomized strategy-tuple that we also (many times) call a strategy, since from 
now on, we shall only work with randomized strategies. We use 7 T l (x,a) to denote the probability of picking 
action a £ A l (x) in state x £ S by agent i. [Filar and Vrieze, 2004, Theorem 3.8.1, pp. 130] states that a Nash 
equilibrium in stationary randomized strategies exists for general-sum discounted stochastic games. We will refer 
to such stationary randomized strategies as Nash strategies. Similar to MDPs [Bertsekas, 1995], one can define 
the value function as follows: 


<Oo) = E 


Y ( r z (x t ,a) J]V(a; t ,a) 


z£A(x) 


i= 1 


, Vi = 1,2. 


(i) 


Let 7 r 1 represent the strategy of the agent other than the i th agent, that is, zr 1 = 7r 2 and 7r 2 = 7T 1 respectively. 
Formally, we define Nash strategies and Nash equilibrium below. 


Definition 1 (Nash Equilibrium) A stationary Markov strategy ir* = (tt 1 *, tt 2 *) is said to be Nash if 


t*(®) > = 1, 2, Vx e S. 


The corresponding equilibrium of the game is said to be a Nash equilibrium. 


Like in normal-form games [Nash, 1950], pure strategy Nash equilibria may not exist in the case of stochastic 
games. Using dynamic programming, the Nash equilibrium condition can be written as: 


v t (x) 


max 

TT i (x)€A(A i (x)) 



r l (x, a) + P Y P(y\x,a)v l (y) 

y£LU(x) 


,Vi= 1,2. 


( 2 ) 


Unlike MDPs, (2) involves two maximization equations. Note that the two equations are coupled because the 
reward of one agent is influenced by the strategy of the other agent and so is the state transition. 


2.2 The Basic Formulation 


The dynamic programming equation (2) for finding optimal values can now be revised to: 


v l {x) = max {£ 7r i( x )Q*(a;, a*)} , Va: £ S, Vi = 1,2, 

7r t (ai)GA (A‘ l (x)) 


(3) 


where 

r l (x,a)+l3 P(y\x,a)v\y) , 
yeu(x) 

represents the marginal value associated with picking action a* £ A' (x), in state x £ S for agent i. Also, 
A(M*(a;)) denotes the set of all possible probability distributions over A' (x). We derive a possible optimization 
problem from (3) in Section 2.2.1 followed by a discussion of possible constraints on the feasible solutions in 
Section 2.2.2. 


Q\x,a % ) = E n -i( x) 
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2.2.1 The objective 


Equation (3) says that v l (x) represents the maximum value of E v iQ l (x, a 1 ) over all possible convex combinations 
of policy of agent i, 7 r* £ A(.4® (a:)). However, neither the optimal value v 1 (x) nor the optimal policy 7r® are known 
apriori. So, a possible optimization objective would be 

,f (v\ tt®) = ( v\x ) - E n iQ z (x, a®)) , 

which will have to be minimized over all possible policies 7 r* £ A(4®(a:)). But Q l (x,a l ), by definition, is 
dependent on strategies of all other agents. So, an isolated minimization of /*(v*,7r*) would really not make 
sense. Rather we need to consider the aggregate objective, 

2 

/( V ,7r) = (VV®), 

i=1 

which is minimized over all possible policies 7r® £ A(4®(a;)), * = 1, 2. Thus, we have an optimization problem 
with objective as /(v, 7r) along with natural constraints ensuring that the policy vectors 7r*(:r) remain as proba¬ 
bilities over all possible actions A l (x) for all states x £ S for both agents. Formally, we write this optimization 
problem as below: 

min /(v, 7r) = £ J2 (v^x) - E n iQ l (x, a 1 )) s.t. 

v ’ 7r i=li€S 

(а) 7r®(a;, a®) > 0, Va® £ 4®(:r), x £ <S, * = 1,2, > (4) 

2 

(б) X] a *) = IjVa: £ »S, * = 1,2. 

2=1 

Intuitively, all those (v, n) pairs which make /(v, 7r) as zero with 7r satisfying (4(a))-(4(b)), should correspond 
to Nash equilibria of the corresponding general-sum discounted stochastic game. The question is: Is this true? 
We address this question in two parts. First, if n* represents a Nash strategy-tuple with v* as the corresponding 
dynamic programming value obtained from (2), then we answer the question whether /(v*, 7r*) is zero? Second, 
if (v*, 7r*) is such that (4(a))-(4(b)) are satisfied and /(v*, 7r*) is zero, then whether 7r* is a Nash strategy-tuple? 
We address these two questions in Femmas 2.1 and 2.2. 

Lemma 2.1 Let (v*,7r*) represent a possible solution for the dynamic programming equation (2). Then, (v*,7r*) 
is a feasible solution of the optimization problem (4) and /(v*, 7r*) = 0. 

Proof: Proof follows simply from the construction of the optimization problem (4). ■ 

Lemma 2.2 Let (v*, 7r*) be a feasible solution of the optimization problem (4) such that /(v*, 7r*) = 0. Then, tt* 
need not be Nash strategy-tuple and v* need not correspond to the dynamic programming value obtained from (2). 

Proof: We provide a proof by example. Choose a 7r* such that it is not a Nash strategy-tuple. Then, to make 
/(v*, 7r*) = 0, we need to compute a v* such that 

v**(x) - E v i* {x) Q i (x,a i ) = 0,Vx £ S,i= 1,2. (5) 

Let R l = {E v t^ x \r'{x, a) : x £ «S) be a column vector over rewards to agent i in various states of the underlying 
game. Also, let P = \En*(x)P{y\x, a) : ar £ tS, j/ £ <S] represent the state-transition matrix of the underlying 
Markov process. Then, (5) can be written in vector form as 

v®* - (H l + /TPv 4 *) = 0, V* = 1,2. (6) 
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Since P is a stochastic matrix, all its eigen-values are less than or equal to one. Thus, the matrix / — j3P is 
invertible. So the system of equations (6) has a unique solution with 

v** = {I- pPy 1 R\i = 1,2. 


Thus for any strategy-tuple 7r*, which need not be Nash, there exists a corresponding v* such that /(v*, tt*) = 0. 


2.2.2 Constraints 

The basic optimization problem (4) has only a set of simple constraints ensuring that 7r remains a valid strategy. 
As shown in lemma 2.2, this optimization problem is not sufficient to accurately represent Nash equilibria of 
the underlying general-sum discounted stochastic game. Here, we look at a possible set of additional constraints 
which might make the optimization problem more useful. Note that the term being maximized in equation (3), i.e., 
E n iQ l (x, a 1 ). represents a convex combination of the values of Q z (x , a 1 ) over all possible actions a 1 £ A l {x) in 
a given state x £ S for a given agent i. Thus, it is implicitly implied that 

Q l (x, a 1 ) < v l {x),\/a l £ A 1 (x),x £ S,i = 1,2,... ,N. 

So, we could consider a new optimization problem with these additional constraints. However, the previously 
posed question remains: Is this good enough to make /(v, if) = 0, for a feasible (v, n) to correspond to a Nash 
equilibrium? We show that this is indeed true in the next section. 


2.3 Optimization Problem for two-player Stochastic Games 

An optimization problem on similar lines as in Section 2.2, for a two-player general-sum discounted stochastic 
game has been given by Filar and Vrieze [2004]. The optimization problem is as follows: 


min/(v,7r) = Y 1^, [v 4 - - pP{ 7t)v i ] s.t. 

v > 7r i=l 

< v 1 (x)^ nl(x) Vx£S 

J 

71-1 0) < v 2 {x)l m 2 (x) Vx £ S 


(a) 7r 2 (x) T | r 1 (x) + p y p (y\ x ) vl (y ) 

y&u 0 


(b) 


r 2 (x)+p Y P(y\x)v 2 (y) | 

yGU(x) 

(c) 7T 1 (x) T l m i (x) = lVx £ S 

(d) TT 2 (x) T l m 2 {x) = 1 \/x£S 

(e) 7T 1 (x, a 1 ) > 0 Va 1 £ ^l 1 (x) Vx £ S 
(/) 7t 2 (x, a 2 ) > 0 Va 2 £ ^l 2 (x) Vx £ S. 


(7) 


where, 

(i) v = (v* : i = 1,2) is the vector of value vectors of all agents with v' = (v z (y) : y £ S) being the value vector 
for the i th agent (over all states). Here, v' l (x) is the value of the state x £ S for the i th agent. 

(ii) 7r = (jt l : i = 1,2) and n‘ l = (tt 1 (x) : x £ 5), where 7r*(x) = (Tt l (x,a) : a £ A’(x)) is the randomized 
policy vector in state x £ S for the i* agent. Here n‘ l (x, a) is the probability of picking action a by the i th agent in 
state x. 

(iii) r*(x) = [r*(x, a 1 , a 2 ) : a 1 £ ^l 1 (x), a 2 £ >4 2 (x)] is the reward matrix for the i th agent when in state x £ S 
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with rows corresponding to the actions of the second agent and columns corresponding to that of the first. Here, 
r l (x, a 1 , a 2 ) is the reward obtained by the ?' th agent in state x G S when the first agent has taken action a 1 G A 1 {x) 
and the second agent a 2 G A 2 {x). 

(iv) r*(x, a 1 , A 2 {x)) is the column in r l (x) corresponding to the action a 1 G A x (x) of the first player. Each entry 
in the column corresponds to one action of the second player which is why we use A 2 (x) as an argument above. 
Likewise, we have r*(x, A 1 (x), a 2 ) is the row in r*(*) corresponding to the action a 2 G A 2 (x) of the second 
player. 

(v) r*(7t) = r*((7T 1 ,7r 2 )) = (7r 2 (x) T r®(x)7r 1 (x) : x G S'), where 7r 2 (x) T r®(x)7r 1 (x) represents the expected 
reward for the given state x when actions are selected by both agents according to policies 7T 1 and 7r 2 respectively. 

(vi) P{y\x) = [p(y\x,a) : a = (a 1 , a 2 ), a 1 G A 1 (x),a 2 G A 2 (x)\ is a matrix representing the probabilities of 
transition from the current state x G S to a possible next state y G 5 at the next instant with rows representing the 
actions of the second player and columns representing those of the first player. 

(vii) P{y\x, a 1 , A 2 (x)) is the column in P(y\x) corresponding to the case when the first player picks action 
a 1 G A 1 (x). As with r®(:r, a 1 , _4 2 (x)), each entry in the above column corresponds to an action of the second 
player. Similarly, P(y\x, A 1 ^), a 2 ) is the row in P(y\x) corresponding to the case when the second player picks 
an action a 2 G A 2 (x). 

(viii) P( tt) = P((7r 1 ,7r 2 )) = [ tt 2 (x) T P(y\x)'rr 1 (x) : x G S,y G 5] is a matrix with columns representing the 
possible current states x and rows representing the future possible states y. Here, 7r 2 (x) t P{y\x) r K 1 (x) represents 
the transition probability from x to y under policy tt. 

(ix) m l {x) = |*4® (a?)|, and (recall that) 

(x) U(x) C S represents the set of next states for a given state x G S. 

The inequality constraints in the optimization problem are quadratic in v and tt. The first set of inequality 
constraints (7(a)) on the first agent are quadratic in v 1 and 7r 2 and the second set (7(b)) on the second agent are 
quadratic in v 2 and tt 1 respectively. However, in both cases, the quadratic terms are only cross products between 
the components of a value vector and a strategy vector. 

The objective function is a non-negative cubic function of v and tt. All the terms in the objective function 
consist of only cross terms. The cross terms between the value vector of an agent and the strategy vector of either 
agent are present in the term P(tt)v 1 of the objective function and those between strategy vectors of the two agents 
are in the term r*(7r). 

We modify the constraints in the optimization problem (7) by eliminating all the equality constraints in it, as 
follows: One of the elements tt‘(x, a 1 ), a 1 G A l {x), in each of the equations I^qx) 77 *(*> a?) = 1 can be automat¬ 
ically set, thereby resulting in inequality constraints over the remaining components as below. Let a' ix). i = 1,2 
denote the actions eliminated using the equality constraint. Then, the set of constraints can be re-written as in (8). 


n 2 (x) T | ? 1 (at) + /3 P(y\x)v 1 (y) 

yeu(x) 




r 2 (x) + /? J2 P{y\x)v 2 (y) ttV) < i> 2 (:r)l m 2 (x) \/x £ S 
yeu(x) 

7r*(a;,a®) < 1 Vx G <S, * = 1,2, 

a*£^t* (x)\{a*(x)} 

7 r®(x,a®) > 0 Vo® G VT(x)\{a®(x)} Vx G S,i = 1,2. 


( 8 ) 


The variables 7r®(x, a’(x)) = 1 — Ya i eA i (x)\{a i (x)} 7r ‘( x ’ Vx G 5,i = 1,2 are implicitly assigned in the 
above set of constraints. Lor the sake of simplicity, in the above, the equations related to the values v 1 and v 2 
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are written without performing elimination of the quantities 7 f(x, a 1 ). For further simplicity, we represent all the 
inequality constraints in (8) as n) < 0, j = 1, 2,... n, where n is the total number of constraints. 

2.4 Theoretical Results on the Optimization Problem 

The optimization problem described above is applicable for general-sum two-agent discounted stochastic games. 
[Filar and Vrieze, 2004, Theorems 3.8.1-3.8.3] are given below as Theorems 2.3-2.5. See [Filar and Vrieze, 2004, 
pp. 130-132] for a proof of these results. 

Theorem 2.3 In a general-sum, discounted stochastic game, there exists a Nash equilibrium in stationary strate¬ 
gies. 

Theorem 2.4 Consider a tuple (v, 7 t). The strategy 7? forms a Nash equilibrium for the general-sum discounted 
game if and only if (v, 7?) is the global minimum of the optimization problem with /(v, 7?) = 0. 

Thus, the optimization problem defined in (7) has at least one global optimum having value zero which corre¬ 
sponds to the Nash equilibrium for the stochastic game. 

Theorem 2.5 Let (v, 7 r) be a feasible point for (7) with an objective function value 7 > 0. Then 7?, forms an 
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e-Nash equilibrium with e < — -—. 

The above result in simple terms, says that, being in a small neighbourhood of a global optimal point of the 
optimization problem (7) corresponds to being in a small neighbourhood of the corresponding Nash equilibrium. 
Thus, there is a correspondence between global optima and Nash equilibria. Thus, this is an important result from 
the point of view of numerical convergence behaviour. 

3 A Gradient Descent Scheme 

The optimization problem (7) for two-player general-sum stochastic games, has an interesting structure with only 
cross products between optimization variables appearing in both the objective function as well as the constraints. 
So, as the first naive way of handling this optimization problem, we see whether the same can be broken down 
into smaller problems via a uni-variate type scheme [Rao, 1996, Section 5.4, pp. 350]. It is possible to see 
that the original problem can be split into two sets of linear optimization problems with (i) the first set having 
two optimization problems in v 1 and if separately. Here, 7r is held constant; and, (ii) the second having one in 
( 7 r 1 (a:), 7 r 2 (a:)} for every possible state x £ S. In each of these cases, v is held constant. Thus, with a uni-variate 
type of break down of the original problem, we get several smaller problems that can be easily solved. However, a 
major drawback of this approach is the inherent deficiency of the uni-variate methods which do not have guaranteed 
convergence in general. In fact, we observed in simulations and also through numerical calculations that this 
approach does indeed fail because of the above mentioned deficiency. Hence, we look at devising a non-linear 
programming approach. The algorithm to be discussed is mainly based on an interior-point search algorithm by 
Herskovits [1986]. 

We first discuss the difficulties posed by the optimization problem (7). We try to address these issues in 
the subsequent sections by presenting a suitable gradient-based algorithm. With a suitable initial feasible point, 
the iterative procedure of Herskovits [1986] converges to a constrained local minimum of a given optimization 
problem. The unmodified algorithm of Herskovits [1986] is presented in Section 3.2. Section 3.3 discusses a 



scheme for finding an initial feasible point. Exploiting the knowledge about the functional forms of the objective 
and constraints, we present in Section 3.5 our modification in Herskovits algorithm to the procedure of selection 
of a suitable step length. And finally the modified algorithm in full, is provided in Section 3.6. 

3.1 Difficulties 

We note that the optimization problem (7) presents the following difficulties. 

1 . Dimensionality - The numbers of variables and constraints involved in the optimization problem are large. 
For the two agent scenario, the number of variables can be shown to be twice the sum of the cardinalities 
of the state and action spaces. For instance, in the terrain exploration problem discussed in Section 4, 
for a simple 4x4 grid terrain with two agents and two objects, the number of variables is (647 x 2) + 
(4169 x 2) = 9632. The total number of inequality constraints for the same can also be computed to be 
(4169 x 2) + (4169 x 2) = 16676. 

2. Non-convexity - The constraint region in the optimization problem is not necessarily convex. In fact, during 
simulations related to the terrain exploration problem (Section 4), we observed that the condition does not 
hold for many constraints. So, in general, the optimization problem (7) has non-convex feasible region. 

3. Issue with steepest descent - As explained in Section 2.2, the objective function in the optimization problem 
(7) is obtained by averaging over strategies, the inequality constraint sets (7(a)) and (7(b)) respectively. This 
has an effect on the steepest descent gradient directions at the constraint boundaries. The steepest descent 
direction has been found to be always opposing the constraint boundaries. As a result, a gradient method 
with the steepest descent direction as its search direction will get stuck when it hits a constraint boundary. 

3.2 The Herskovits Algorithm 

We observed that in the optimization problem (7), steepest descent directions most often oppose the active con¬ 
straint boundaries. Hence a steepest descent direction cannot be used as it would get stuck at one such boundary 
point which may not be an optimal point. Herskovits method offers two features which address this issue: (1) The 
search direction selected at each iteration, while being a strictly descent direction, makes use of the knowledge of 
the gradients of constraints as well as the gradient of the objective; and (2) the procedure is strictly feasible, i.e., at 
any iteration, the current best feasible point is not touching any constraint boundary. 

Assumptions: 

The assumptions required for the two-stage method are as follows: 

(i) The feasible region 12 has an interior 12° and is equal to the closure of 12°, i.e., 12 = it". 

(ii) Each (v, tt) £ 12°, satisfies gf'v, tr) < 0, i = 1,2,..., n. 

(iii) There exists a real number a such that the level set f2 0 = {(v, n) £ 12|/(v, tt) < a} is compact and has an 
interior. 

(iv) The function / is continuously differentiable and gj, j = 1,2,..., n, is twice continuously differentiable in 
12 a . 

(v) At every (v, tt) £ 12 a , the gradients of active constraints form an independent set of vectors. 
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It can be seen that assumptions (i)-(iv) are easily verified considering the functional forms of the objective 
and constraints of the optimization problem (7) and that state space, S and action space. A, are assumed to be 
finite. Assumption (v) is carried over as it is. We present the algorithm of Herskovits [1986] in two parts: First, 
we provide the two-stage feasible direction method in Algorithm 1 and then in Algorithm 2, we present the full 
algorithm. 

Algorithm 1. Two-stage Feasible Direction Method 
Parameter: a £ (0,1), p 0 > 0 

Parameter: Wj (vq. ttq) > 0, j = 1, 2,..., n, continuous functions 
Input: V/Oo, 7T 0 ), Vgj(v 0 , n 0 ),j = 1,2,..., n 
Output: S, a feasible direction 

1. Set p ■£- pq. 

2. Compute 70 € 5ft", Sq £ 5ft" by solving the linear system 


So 


Vf(v 0 ,7T 0 )+ E jojVgj(vo,7To) 
j'=i 


So Vgj(v 0 ,7r 0 ) = -w j (vo,7ro)jojffj(vo,7r 0 )J = l,2,...,n 


(9) 


3. Stop and output S <— 0 if So = 0. 

(1 — a) n 

4. Compute pi = — -, if ]C 7 0l > 0. Also, p -s— ^ if pi < p. 

E 7ot 8=1 

i—1 

5. Compute 7 G 5ft" and S £ 5ft" by solving the linear system 


S = — 


v/(t) 0,7T 0 ) + J] 7jVs J -(«0,7To) 

3 =1 


S T V gj (vo,7To) = - [wjivo^o^jgjivo^o) + pll'S'oll 2 ] , j = 1,2,.. ,,n 


> 


( 10 ) 


where ||,S'o|| is the Euclidean norm of the direction vector So- 
6. Output S. 

This method computes a feasible direction in two stages. In the first stage (equation (9)), it computes a descent 
direction So- By using its squared norm as a factor, a feasible direction S is computed in the second stage (equation 
(10)). Note that the second stage ensures that all the active constraints have S T Wgj(v o,7To) = — pUSoll 2 where 
the right hand side is strictly negative. Thus, gradients of active constraints are maintained at obtuse angles with 
the direction S and hence, the vector S points away from the active constraint boundaries. Thus, feasibility of the 
direction S gets ensured. Let S = S%, SS be a descent search direction where S'* (resp. S^) is the search 

direction in v l (resp. n l ) for i = 1,2. Also, let S v = (S^, S^) and S n = S^). We now present the original 

Herskovits algorithm as Algorithm 2. 

Algorithm 2. The Herskovits Algorithm 
Parameter: v > 1, So £ (0,1), p G (0,1) 

Input: (vq, ttii ): initial feasible point which is a strict interior point. 

Output: (u*,7r*) _ 

1. iteration 4— 1. 

2 . (v,tt) £- (vo,7To) 

begin loop 

3. Compute feasible direction S using the two-stage feasible direction method (algorithm 1). 
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4. Stop algorithm if S = 0. Set (v*,n*) t— (v, tt). Output (v*,tt*). 

5. Let 7 = (71,72, ■ • ■ ,7 ra ), be the Lagrange multiplier vector computed in Algorithm 1. Define Sj = 5o if 
7 j > 0 and Sj = 1 if 7 j < 0. 

6. Find t, the first element in the sequence {1, 1/v, I jv 1 .... } such that 

f(v + tS v ,Tr + tS n ) < /(v,?r) +f??S' T V/(v,7r), and 
g(v + tS v ,ir + tS n ) < 5ig(v ,7t), Vi = 1,2,... ,n. 

7. Stop algorithm if t = 0. Set (v*,7r*) •<— (v,7r). Output (v*,7r*). 

8. (v, 7 ?) •<— (v, 7r) + tS 

9. iteration •<— iteration + 1. 

end loop 

This algorithm can be tuned by utilizing the knowledge about the structure of the optimization problem (7). 
We present the following modifications in this direction: 

1. Computing the initial feasible point by a set of simple linear programs in Section 3.3; 

2. Exploiting the sparsity of the matrix involved in computing the two-stage feasible direction in Section 3.4; 
and 

3. Knowing the cubic form of the objective and the quadratic form of constraints, and computing an optimal 
step-length in Section 3.5. However, we keep the condition 11 satisfied while selecting the step-length. 


3.3 Initial Feasible Point 


The optimization problem given in (7) has a distinct separation between strategy probability terms and value vector 
terms. This can be exploited to find an initial feasible solution using the following procedure. First, a feasible 
strategy is selected, for instance, a uniform strategy, 7To = (7^ : i = 1,2) with 

7rQ(cc,a) = —Va £ x e »S, i = 1,2. (12) 

m l (x) 

If this strategy is held constant, then it is easy to see that the main optimization problem (7) breaks down into two 
linear programming problems in v 1 and v 2 , respectively, as given in (13). For the Herskovits algorithm, a strict 
interior point is desired to start with. That is, the initial point for the algorithm needs to be strictly away from all 
constraint boundaries. So, we introduce a small positive parameter, a > 0, in the left-hand side of the constraints 
given in (13). 


s.t. 


Jin {ljs| ( yl - rl ( 7r o) - pP^v 1 ) , | 




r 1 {x)+f3 J2 Pivlx^iy) 

y£U(x) 


+ a< Vx e 5, 


nn {lj5| ( v 2 - r 2 (7r 0 ) - f3P(n 0 )v 2 )} , 


r 2 (x) + fj Y, P(y\x)v 2 (y)\'xl(x) + a<v 2 (x)l rn i {x) \lx&S. 
yeu(x) 


( 13 ) 
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The two optimization problems in (13) can be solved readily with the popular method of revised simplex 
[Chvatal, 1983, Chapter 7], Since the main purpose here is to just get an initial feasible point, the first phase of the 
revised simplex method in which the auxiliary problem in an artificial variable is solved for, is in itself sufficient. 
The relevant details related to the first phase of the revised simplex method have been described in [Chvatal, 1983, 
Chapter 7], An initial feasible point (vo, no) can thus be obtained. 

3.4 Sparsity 

The two-stage feasible direction method given in Algorithm 1 requires inverting a matrix of dimension the same 
as the number of constraints. Note that the number of constraints in the optimization problem is large (See Sec¬ 
tion 3.1). Thus, in principle, our method would require a large memory and computational effort. However, we 
observe that the matrix to be inverted is sparse. Hence, we use efficient techniques for sparse matrices that result 
in a substantial reduction in the computational requirements. The matrix to be inverted is given by 

Vff[Vg 1 -w 1 g 1 Vg[Vg 2 ... Vg[Vg„ 

Vg?V 9l Vg£Vg 2 -w 2 g 2 Vg 2 Vg n 

H = . . (14) 

V.g^V.91 V^V.92 ••• V g^V g n - w n g n _ 

The elements of H can be seen mainly to be dot products of constraint gradients. In a typical stochastic game, 
the number of states which are related by non-zero transition probability, is less compared to the total number of 
states. The same is applicable when we consider the action set. The action set available at each state is usually less 
overlapping with corresponding (action) sets of other states. In some cases, these sets may be completely disjoint 
as well. For the simple terrain exploration problem which shall be discussed in Section 4, the above matrix for a 
4x4 grid scenario with two objects and two agents, is of size 16676 x 16676, and is only about 4% full. 

We note that the two-stage feasible direction method does not really require an explicit inverse of the matrix. 
Rather, it requires the solution to the linear system of equations Hj = b, where b is a vector of appropriate 
dimension. We target to use decomposition techniques for the purpose in which the matrix is decomposed as 
H = LDL t where L is a lower triangular matrix and D is a diagonal matrix. Since H is also sparse, we do sparse 
LDL t decomposition of the matrix H using techniques discussed by Davis et al. [2007], Davis [2007], using 
publicly available software on the internet. Upon decomposition of the matrix H, the solution to 7 can be easily 
computed. 

3.5 Computing the Optimal Step-length 

The objective function has been shown previously to be cubic and constraints quadratic in the optimization vari¬ 
ables. This structure can be exploited to find the optimal step length, t*, in any chosen direction. 

3.5.1 Optimal Step Length, t* 

Let (uo, 7 To) be the current point and (v, 7 t) be the next point obtained from the previous by moving one step along 
the descent direction. Thus, v = vq + tS v and n = no +tS n . Upon substitution into the objective function /(v, 7 t), 
one obtains, 

/(v, 7 r) = f{v 0 + ts v , 7 T 0 + ts n ) = do + dit + d 2 t 2 + d 3 t 3 (15) 
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where 


do=± lj^| {«o - r l {ir o) - /3P(tv 0 )v 1 0 } , 

i= 1 

di = Eli r s\{ s l - ( r '((no,Sl)) +r l ((Sl,n$))) 

-/3(P(7r 0 )5i + P«7r 0 1 ,5|)K + P«5i,7r 0 2 )K)}, (16) 

d 2 = E 1^| {-r\S w ) - 0 (P((n&,SZ))S* + P((Si,^))St + P(S.K)}, 

i= 1 

ds= Elf S |{-/3P(^)5j}. 

i=l ' 

In the above equations, the search direction terms Si and ,S' 2 have been used in places where strategy terms are 

expected. Note that the search direction terms Si and Sz do not form strategies. The usage here is purely in the 

functional sense. 

df 

Now, from — = 0, one obtains d± + 2d,2t + 3d3< 2 = 0. Hence the extreme points of /(v, tt) are given 

at 


by t* = 


—di ± v/df — 3did3 


. (i) If — Sdidz < 0, then with increasing t, the function value shall decrease 


monotonically in the chosen direction S. So, any value t > 0 is fine without considering the constraints (see Figure 
1). (ii) If d% - 3di d 3 > 0, we get two extreme points in the chosen direction. If any of these points has a negative 
t value, it is ignored. Since the direction is known to be descent, if one extreme point is negative, then so will be 
the other extreme point as well (see Figure 1). Till now only the objective function was considered. The approach 
to handle the constraints will be explained next. 


3.5.2 Constraints on step-length, t 

The constraints in the optimization problem (7) impose limits on the possible values that t can take. Consider the 
inequality constraint (7(a)) for a particular a 1 G ^l 1 (a;). Let gj(-) < 0 represent one of these constraints and let 
Si represent the corresponding parameter computed in step 6 of the Herskovits method (Algorithm 1). Apart from 
feasibility of step-size, we wish to ensure that the condition (11), i.e., tt) < 6jgj(v o, zro), is ensured as well. 
Now, substituting v = vq + tS v and tt = no + tS„, and rearranging we get, 

b x (x , a 1 ) + c 1 (x, a 1 )^ + d x (x, a})t 2 < 0, (17) 

where 

^(x.a 1 ) = (1 - 6j)gj(vo,ir 0 ), 
c^z.a 1 ) =no(x) T /3J2P(y\x,a 1 )Sl(y) 

y£U(x ) 

+Sl{x) T r x (x, a 
d 1 (x,a 1 ) = Sl(x) T /3Y.P(y\x,a 1 )Sl(y) 

yGU(x) 

respectively. Let2?(a;,a 1 ) = c 1 (x,a 1 )c 1 (x,a 1 ) — 4b 1 (x,a 1 )d 1 (x,a 1 ). Consider the case where T> < 0. This 
implies that the quadratic does not intersect the (-axis at any point i.e., it lies fully above the (-axis or fully below 
it. Clearly, d 1 (x, a 1 ) < 0, implies that the quadratic lies fully below the f-axis and vice versa for d 1 (x, a 1 ) > 0. 

Proposition 3.6 If'D{x,a 1 ) < 0, then d}(x,a}) < 0. 

Proof: Suppose this is not true. Then d x {x, a 1 ) >0. Hence, b x (x, a 1 )d 1 (x , a 1 ) < 0 because by definition of Sj 
and gj(vo,ito), we have that b 1 (x, a 1 ) < 0. Thus, we obtain c 1 (x,a 1 )c 1 (x,a 1 ) — Ab 1 (x,a 1 )d 1 (x,a 1 ) > 0 which 
is a contradiction. Hence, d 1 (x, a 1 ) <0. ■ 
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Thus, for the case when T>(x,a l ) < 0, any value of t > 0 is fine as the quadratic is fully below the f-axis. Now 
consider the case where T>(x, a 1 ) > 0. On solving the quadratic function, we get its two roots as 


-c 1 (a 


. 1 , In yu,,a 1 ) +y/Vix^a}) , x —c 1 (x, a 1 ) - ^V^x^a 1 ) 

^ a ) =- 2 rfi(a;, ffl i) - and a } =- 9/-?i(„n -’ 


2d 1 (x, a 1 ) 


(19) 


Va 1 £ A 1 )®) Vx £ 5. If t\(x, a 1 ) > t^x^a 1 ), it can be shown that the region allowed by this constraint is 
given by the interval [t\(x, a 1 ), t\(x, a 1 )] in the given direction S. Otherwise for < t\(x, a 1 ), the region 

allowed by this constraint on the real line is given by the interval (—oo, t\(x, a 1 )] U [£2(2;,a 1 ),oo). Note that 
this implies that the constraint is not convex. The above explanation can be easily adapted for constraints on 
the second agent as well. Thus, feasible value ranges for t imposed by the constraints (7(a)) and (7(b)) can be 
obtained. We formalize in Algorithm 3, this process of computing feasible value ranges for t imposed by quadratic 
constraints (17). 

Algorithm 3. Feasible x from a Quadratic Constraint 
Input: b, c, d - Coefficients of quadratic constraint b + cx + dx 2 < 0. 

Output: L - Feasible x set 


if d = 0 then 
if c = 0 then 
if b > 0 then 

L = (j) 

else 
L = 

end if 

else if c > 0 then 

L = [— b/c , 00) 

else 

L = (—00, —b/c] 

end if 

else 

V = c 2 — 4 bd 

if V < 0 then 
if d > 0 then 

L = 4> 

else 
L = 5ft 


end if 

else 

Xi = 

X2 = 


—c + \fv 
2d 

-c-y/V 


2d 

if x ’2 < 2'i then 


{Upper limit, x < £ 1 } 
{Lower limit, x > X 2 ] 


L = [x 2 ,x\\ 

else 


L = (- 00 , Xi] (J[* 2 ,oo) 
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Equality constraints (7(c)) and (7(d)), on the values of i r are l^ li , x ^Tr' l (x) = 1 \/x £ S,i = 1,2. On using 
7 T = 7 r 0 + tS v and r^(x) = 1, Vx G S, we get, 

* x [iL*(x) s i( x ) =° yxes, 


which do not impose any condition on the value of t. However, the set of inequality constraints (7(e)) and (7(f)), for 
non-negativity of strategy vectors, i.e., 7 r*(x, a?) > 0, Va 7 G A z (x), x G S, i = 1,2 may impose an upper limit on 
the value of t. For example, consider 7 r* (a:, a- 7 ) > 0. Let 7 Tq (x, a- 7 ) represent the current best value of the probability 
of picking action a - 7 in state x by agent i, and let S :i represent the corresponding parameter computed in step 6 of the 
Herskovits algorithm, (see Algorithm 2). We wish to satisfy the condition (11), i.e., 7 r*(x, a - 7 ) > Sjtt^x, a 7 ). Upon 
substituting 7 r l (x, a 7 ) = 7 Tq(x, a 7 ) + tS^(x, a- 7 ), we get, ttq(x, a 7 ) + tS^(x, a- 7 ) > djir^x, a 7 ). If jS£(x, a- 7 ) < 0, 


then. 


t < 


K(x,a j ){l -5j) 

~Si(x,aj) 


or t G 


/_ ttq(x, a 7 )(l - <5j) ~ 

V ’ -^(x,a 7 ) . • 


Note that t > 0 is implicitly assumed, else while S is a descent direction, tS would not be one. Thus, if S£(x, a 7 ) > 

7Tn(x, a 7 )(l — 6j) 

0 , we get t > — -—- —— which does not impose any additional constraint on t and hence can be ignored. 

-5;(x,« 7 ) 

Intersection of feasible regions given by all constraints in (7(a)), (7(b)), (7(e)) and (7(f)), gives the feasible set of 
values for t, from which a suitable step length t* > 0 is selected. The procedure for doing so is explained next. 


3.5.3 Selection of the optimal step length, t* 

Along the chosen descent direction S, the objective function /(v, ir) is a cubic function in the step length t. If the 
extreme points are real and both positive, then the first sub-point in the descent direction will be a minimum point 
and the next a maximum point as shown in Figure 1. So, under this condition, the best point is obtained by finding 
the best among the minimum point (or two feasible points near the minimum point) and the maximum step length 
point which is decided by the constraints. Otherwise, the cubic curve would be like the dashed curve in Figure 1. 
In such a case, the optimal step length is simply the maximum feasible step length. 


3.5.4 Optimal Step Length Algorithm 


Algorithm 4. Step Length Calculation 
Parameter: /?: discount factor 
Input: (vo, 7 To): current value strategy pair 
Input: S: selected descent direction 

Output: t: The best step length _ 

1. Calculate d\, d 2 and d 3 using (16). 

2 . (ti,t 2 ) roots(di,2d 2 ,3d 3 ) 

3 . F <— R + , the set of all non-negative real numbers 

for x € 5, a 1 € A 1 (x), a 2 G -4 2 (x), i = 1 , 2 do 

4. Fg ffl quadratic/easible(b l (x, a ? ), c*(x, a 1 ), d l (x, a 1 )) where b l (x, a*), c*(x, a*), cf (x, a*) are from 
(18). 
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t 


Figure 1: Cubic Curves (See Section 3.5.2 for details) 


- S l 7r (x,a i )_ 


if S'; (cc, a 1 ) < 0. 


5.F<-fn 

end for 

6. If the extreme points ti and £2 are real and both positive, the best step t is obtained by finding the best 
amongst the minimum point ti (or two feasible points in F near the minimum point) and the maximum step in 
F. Otherwise, the best step t is the maximum step in F. 


In the above, roots(a,b, c) gives the roots of a 
quadraticf easible(a, b , c). 


bx + cx 2 = 0. Algorithm 3 is being referred to as 


3.6 The Complete Algorithm 

With the schemes discussed in Sections 3.3, 3.4 and 3.5, we present the modified Herskovits algorithm. 

Algorithm 5. The Complete Algorithm 
Parameter: /?: discount factor 
Input: 7To: initial strategy (from (12)) 

Output: (v*,n*): An e-Nash equilibrium with e = 


/«7T*) 


1. iteration -s— 1. 

2. 7T <r~ TTq- 

3. Compute v from linear programs in (13) using only the first phase of the revised simplex method (see Section 
3.3). 

begin loop 

4. Compute feasible direction S using the two-stage feasible direction method (algorithm 1). 

f( r * 7r*) 

5. Stop algorithm if S = 0. (v*,ir*) (v. i f). Output (v*,n*) and e = —-———. Terminate the algorithm. 

6. Compute the constrained optimal step length t by the procedure described in Section 3.5. 

xv. xv. f (v* 7T*) 

7. Stop algorithm if t = 0. (v* , 7 r*) <— (v, n). Output (v*, n*) and e = —-——. Terminate the algorithm. 


8. {v, 7 f) (v, 7 f) + tS 


1-/3 
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9. iteration G- iteration + 1. 

end loop 

Note that in the above algorithm, equality of S to zero and also that of t to zero are to be considered with 
a small error bound around zero to handle numerical issues. The computational complexity per iteration of the 
algorithm is 0(|„4| 3 ) multiplications contributed mainly from the steps involving formation and decomposition of 
the inner product matrix, G. However, the factor multiplying _4| 3 can be shown to be far less than one in the actual 
implementation. 

3.7 Convergence to a KKT point 

KKT conditions represent a set of necessary and sufficient conditions for a point to be a valid local minimum of an 
optimization problem. We write down the necessary conditions for a point (v*, tt*) to be a local minimum of the 
optimization problem (7): 

(a) V/(u*,tt*) +E"=i = 0, 

r*) = 0 ,j = 1,2 

(c) 9j(v*,Tt*) > 0 ,j = 1,2,..., n, 

(d) Aj > 0, j = 1,2, 

where A j,j = 1,2, ...,n, are the Lagrange multipliers associated with the constraints, gj(v, n) > 0 ,j = 
1,2,... ,n. Let I = {j\gj(v, 7r) = 0} be the set of active constraints. It can be easily shown that, the above 
set of conditions are sufficient as well if, the gradients of all active constraints form a linearly independent set. 

We note here that the entire proof of convergence to KKT point presented in [Herskovits, 1986, Section 3] 
for the unmodified Herskovits algorithm, can easily be seen to be applicable as it is, to the modified Herskovits 
algorithm, i.e.. Algorithm 5. In the next section, we apply this algorithm to a simple terrain exploration problem, 
modelled as a general-sum discounted stochastic game, and observe in the simulations that the convergence is also 
to a Nash equilibrium. However, later in Section 5, we show that in general, convergence to a KKT point is not 
sufficient to guarantee convergence to a Nash equilibrium. 

4 A Simple Terrain Exploration Problem 

A simplified version of the general terrain exploration problem is presented below. Consider a pair of agents that 
are assigned the task of collecting a set of objects located at various positions in a terrain. We assume that the 
object positions are known aproiri. The game between the pair of agents terminates if all the objects are collected. 
The agent movements are considered to be stochastic. Modelling of this problem as a discounted stochastic game 
(,S , A,p, r, /3) is described as follows. 

(i) State Space, S - Let the entire terrain be discretized into a grid structure defined by Sq : G x G where 
G = {0, ±1, ±2,.... ±M} . The position of an agent can be represented by a point in Sq- Let the position of the 
i th agent be denoted by x l G Sq with ^f 1 ),^! 2 ) g Q being its two co-ordinate components. So, the positional 
part of the overall state space considering the two agents, is given by S p = Sq x Sq ■ The status regarding whether 
a particular object is collected or not is also a part of the state space. So, the overall state space would be given 
by S' = S p x {0, 1 } K , where K is the total number of objects to be collected from the terrain. Let o, represent 
the Boolean variable for the status of the i th object. Here o, = 0 implies that the i th object is not yet collected 
and the opposite is true for Oj = 1. Thus, x = (x 1 , x 2 , 0 \, 02 , ■ ■ ■ ,Ok ) G S' where x l G Sq, i = 1,2. Let 
B = {y G Sq : an object is located at y}. The two sets S\ = {x G S' : x l G B and o x i = 0 for some * = 1,2} 
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and S -2 = {x £ S' : Oj =1 Vj = 1 to K} , represent those combinations of states which are not feasible. Thus, 
the actual state space containing only feasible states is S = [<S'\ (S-[ U £ 2 )] U T, where T represents the terminal 
state of the game. 

(ii) Action Space, A - The action space of the i* agent can be defined as 

A l {x) = { Goto y € S G ■■ d^x 1 ^) < 1} , 

where x 1 £ Sq is the position of the ( th agent and d 00 (x’ 1 , y) = max(\x 1 ^ 1 ' 1 — t/ 1 ) |, —j/ 2 )|) is the L°° distance 

metric. The aggregate action space of the two agents at state x £ <S\{T} is given by A(x) = A 1 (x) x _4 2 (:r). 
Note that x = (x 1 ,x 2 , 01 , 02 ,..., ok)- Thus, the action space does not depend upon the object state except for 
the termination state T. For the termination state T, the only action available is to stay in the termination state. 
The action related to the termination state T is ignored in subsequent discussions. 

(iii) Transition Probability, p(y\x,a) - The movements of each agent are assumed to be independent of other 

agents. The transition probability p t (y l \x 1 , a 1 ) forthei* agent is given by a 1 ) = C(x l ) 2~ d d a '’V l ) \/y l g 

U l (x l ) C Sq, i = 1,2, where C{x l ) = ^ 2~ d d a% ’V) is the normalization factor chosen to make this a 

yet/qx 4 ) 

probability measure and di(a\y) = (ja^ 1 ^ — + |a^ 2 -* — y^\) , the L 1 norm distance between a 1 and y. 

The joint transition probability is given by p(y\x, a) = p 1 (y 1 \x 1 , a 1 )p 2 (y 2 \x 2 , a 2 ). 

(iv) Reward function, r(x, a) - To ensure that the two agents do not get to the same position, a penalty may be 
imposed on the two agents when they attain the same position. Thus, the stochastic reward function for the i th 
agent can be defined accordingly as 

f -3 if 2 /* = y 3 ,j = l, 2 ,j j^imdOy ± ( 1 , 1 ,..., 1 ), 

r*(x, a, y) = < 1 if object present at y l , (21) 

I 0 otherwise, 

( = 1,2. The reward r l {x, a) is given by r l (x, a) = a, y)p{y\x , a). 

yes 

4.1 Simulation Results 

Simulation results for G = {0,1, 2, 3} with two objects situated at (0,3) and (3, 3) and discount factor (j = 0.75 
are described below. The parameters given to the two-stage feasible direction method are Wj(vo,TTo) = 1, j = 
1,2,..., n, a = 0.5 and p 0 = 0.9. 

4.1.1 Objective Value 

The convergence of the objective value using Algorithm 5, to a value close to zero is shown in Figure 2. After 
getting an initial feasible solution, the objective value was w 102.37. 

4.1.2 Strategies 

The convergence behaviour of strategies of both agents with the initial position of the first agent being (2,1) and 
that of the second being (2,0), respectively, is shown in Figures 3 and 4 respectively. The arrows in the various 
grids in Figures 3 and 4 signify the feasible actions in each state and their lengths are proportional to the transition 
probabilities along the corresponding directions. With the given initial positions of agents and object locations, 
strategies pertaining only to those positions which an agent can visit with the other agent sticking to its own position 
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Figure 2: Objective Value vs. Number of Iterations 

are plotted. Consider for instance. Figure 3. The figure shows the strategy of the first agent with the second agent 
sticking to the position (2,0). At the start of the algorithm, all transition probabilities are chosen according to the 
uniform distribution. In Figures 3 and 4, we show the strategy profile of both the agents after the 1 st , 11 th and 100 th 
iterations, and upon convergence of the algorithm. The algorithm converged in a total of 278 iterations. 

The Nash strategies have an interesting structure here which is evident in Figures 3 and 4 as well. The strategies 
are deterministic except when both agents are in the vicinity of one another. This is expected from the structure of 
the reward function. Also, it is clear from Figures 3 and 4 that strategy components that are near to the two objects 
converge faster compared to those which are farther from the two objects. Note that strategy components of those 
positions which have no probability of being visited by an agent, with the agent being in a particular position, are 
not shown with arrow marks. For instance, in Figure 3(d), position (1,1) has no probability of being visited by the 
first agent located at (2,1). 

5 Non-Convergence to a Nash Equilibrium 

Theorem 2.4 showed that it is both necessary and sufficient for a feasible point (v*, 7r*) to correspond to a Nash 
equilibrium, if the objective value, /(v*, 7r*) = 0. However, for a gradient-based scheme, it would be apt to have 
conditions represented in terms of gradients of the objective and constraints. In this direction, we now present a 
series of results which ultimately give the desired set of necessary and sufficient conditions for a minimum point 
to be a global minimum. For a given point (v, 7r), let G = [Vgj (v, n) : j = 1,2,..., N] represent a matrix whose 
columns are gradients of all the constraints (8). 

Proposition 5.7 At any given point (v, 7r), the gradient of the objective function /(v, 7r), can be expressed as a 
linear combination of the gradient of all the constraints (8). In other words, V/(v,7r) = GX' where X' is an 
appropriate vector. 
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Figure 3. Convergence of the strategy updates of the 
first agent when it is located at (2,1) and the other agent 
is located at (2,0). 


Figure 4. Convergence of the strategy updates of the 
second agent when it is located at (2,0) and the other 
agent is located at (2,1). 


Proof: Let h\(x,a 1 ) = 7t 2 (;e) t 


r - 1 {x,a 1 ,A 2 (x)) + /3 E P{y\x,a 1 ,A 2 (x))v 1 (y) 


yeu(x) 

0 represents the set of constraints (7(a)). Similarly, let h 2 (x,a 2 ) = \r 2 {x 1 A 1 {x),a 2 ) + l3 


v x (x). Then, h\ {x, a 1 ) < 

E P{y\x,A\x)y)v 2 {y) 


[ y£U(x) 

v 2 (x). Thus, h 2 (x, a 2 ) < 0 represents the set of constraints (7(b)). Now, we observe that the objective of the opti¬ 
mization problem (7) can be re-expressed in terms of hi(x, a 1 ) and Tt l (x, a 1 ), i = 1,2, as follows: 

2 


/(v,tt) = ^2^2 T l {x,a l )hi(x,a l ). (22) 

i —1 x€lS 

So, the objective can be visualized as the sum of products between LHS of (7(a)) and that of corresponding 
constraints in (7(e)). Note that the equality constraints are easily eliminated as expressed in (8). Thus, all the 
constraints of interest are inequality constraints which pair up, one from (7(a))-(7(b)) and the other from (7(e))- 
(7(f)). It is now easy to see the desired result by considering the chain-rule of differentiation. ■ 


7r 1 (a:)— 


Note that the vector X' discussed in the proposition 5.7, is in value the same as the negative of the pair constraint. 
For instance, let for some j, gj(v,Tt) = h l (x,a z ). Then, A' = —iT l {x,d l ). Similarly, for some j for which 

gj(y, 7r) = n l (x, a 1 ), we have A' = —h l (x, a 1 ). 


Let A' = 


A \ 

A 'k 


, where A' ; is the part of A' corresponding to active constraints and \' K is that corresponding 


to inactive constraints. Similarly, let the set of Lagrange multipliers, A = 


A i 
A K 


, where A/ is the part of A 


corresponding to active constraints and A k is that corresponding to inactive constraints. 


Lemma 5.8 Under assumption (v) of Section 3.2, if \' K = 0 at a KKT point (v*, 7r*), then Xj — —Xf 

Proof: Let G = [Gj Gk] be the previously defined matrix of gradients of all constraints, where G/ is the 
part of matrix G containing gradients of all active constraints and Gk that containing gradients of all inactive 
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constraints. Now, from proposition 5.7, we have that V/(v*, 7r*) = GA'. From the KKT conditions (20), we have 
V/(v*, 7t*) = —GA. Combining the two, we get G(A + A') = 0. Since G is full-rank from assumption (v), we 
have 

G t G(A + A') = 0. 

This can be re-written as 

nTn r<T n 

Lrj Lx/ Lxj Lx K 

r<T r<T r < 

U k Lti Lt k Ltk 

In other words, we have a set of simultaneous equations as follows: 

GjGi(\i + A/) + GJGk(^k + A ’ K ) = 

Gj c Gi(Xi + A/) + GJ(Gk(Xk + X! k ) = 

Note that at a KKT point, it is easy to see that A k = 0. Also, we have X' K 

Gf G/(A/ + Aj) = 0. 

Since G/ is of full rank, GjGi is invertible. Hence, the result. 

Corollary 5.9 Under assumption (v) of Section 3.2, if \' K = 0 at a KKT point (v*, it*), then A = —A'. 

Proof: At a KKT point, A k = 0. Thus, the result follows. ■ 

Theorem 5.10 A KKT point (v* , 7r*) corresponds to a Nash equilibrium of the underlying general-sum stochastic 
game, if and only if X' K = 0. 

Proof: We provide the proof in two parts below. 

If part: From corollary 5.9, we have A = —A'. Let us consider a pair of constraints, h l (x, a 1 ) < 0, and 7 r z (x, a 1 ) > 
0 for some x G S,a 1 G A l (x),i = 1,2. We consider the following cases in each of which we show that 
h l (x, a 1 )tt l {x, a 1 ) = 0 independent of the choice of x G S,a z G A l (x), i = 1,2. Thus from (22), it would follow 
that /(v*, 7r*) = 0. The result then follows from theorem 2.4. 

1. When h l (x,a l ) < 0 and 7r*(a;, a 1 ) = 0 or h l (x, a 1 ) = 0 and ^(x, a 1 ) > Oor h l (x,a l ) = 0 and n l (x, a 1 ) = 0 

the result follows. 

2. The case when h l (x, a 1 ) < 0 and n l (x, a ®) < 0 does not occur. We prove this by contradiction. Suppose 

this case holds. Since h l (x, a 1 ) < 0, is an inactive constraint, by complementary slackness KKT condition 
(20(b)), we have that the corresponding A ; = 0 = —A' = a 1 ). Thus, this case does not occur. 

Only if part: If a KKT point (v*,7r*) corresponds to a Nash equilibrium, then by theorem 2.4, we have that 
/(v*, 7r*) = 0. From equation (22), we have 

2 

yy —7r*(a;, a l )hi(x, a *) = 0. 

i =1 x£S 

Since a KKT point is always a feasible point of the optimization problem (7), every summand in this equation is 
non-negative. So, we have 

7 T l (x,a l )hi(x,a l ) = 0,Vx G S,a l G A l (x),i = 1,2. (25) 

Now from (25) and the complementary slackness KKT condition (20(b)), it is easy to see that X' K = 0. ■ 


0, (23) 

0. (24) 

= 0. So, from (23), we have. 


A i + Xj 
X k + X' K 


= 0. 
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Definition 2 (KKT-N point) A KKT point (v*,7r*) of the optimization problem (7) is defined to be a KKT-N 
(KKT-Nash) point, if the matrix 

G' = Gj- (/ — Gi(GjGi)~ 1 Gj) Gk, computed at the KKT point, is full rank. 

Lemma 5.11 Under assumption (v) of Section 3.2, a KKT-N point (v*,7r*) of the optimization problem (7), cor¬ 
responds to a Nash equilibrium of the underlying general-sum discounted stochastic game. 

Proof: From assumption (v), we have that Gj is full-rank and hence GjGi is invertible. So, (23) can be simplified 
to get 

(A/ + Aj) = — (gJGi)~ 1 GJG k (\ k + \' K ). 

Note that A k = 0 by definition at a KKT point. The above can be substituted in (24) and simplified to get 

G t k (/- G / (GfG / )- 1 Gf) GkX'k = 0. 

Since at a KKT-N point, the matrix G^ (/ — G/(G|’G/) _1 Gj) Gk is full-rank, we have \' K = 0. Now we have 
the desired result from theorem 5.10. ■ 

Thus, we have a sufficient condition for a KKT point to correspond to a Nash equilibrium. Note that the 
matrix G' that needs to be of full rank is dependent on (i) the reward function and state transition probabilities 
of the underlying stochastic game, (ii) the value function and strategy-pair at the current KKT point, and (iii) the 
set of active and inactive constraints. These dependencies are highly non-linear and difficult to separate. Using 
this sufficient condition, we can obtain a weak result on the convergence of gradient-based algorithms to Nash 
equilibrium solutions, as follows. Here by gradient-based algorithms, we mean those algorithms which assure 
convergence to a KKT point of a given optimization problem. For instance, the algorithm given in Section 3 is one 
such algorithm. 

Theorem 5.12 Under assumption (v) of Section 3.2, if every KKT point is also a KKT-N point, then any gradient- 
based algorithm when applied to the optimization problem (7) would converge to a point corresponding to a Nash 
equilibrium of the underlying general-sum discounted stochastic game. 

On the contrary, if a general-sum discounted stochastic game is such that there is at least one KKT point which 
is not a KKT-N point, then convergence of plain gradient-based algorithms to Nash equilibrium is not assured. 

6 Conclusion 

We first proposed a simple gradient descent scheme for solution of general-sum stochastic games. During the 
construction of the scheme, we discussed the overall nature of the indefinite objective and non-convex constraints 
illustrating the fact that a simple steepest descent algorithm may not even converge to a local minimum of the 
optimization problem. The proposed scheme takes these issues while constructing both (i) feasible search direction 
as well as, (ii) optimal step-length. Also, it tries to address numerical efficiency by appropriately using sparsity 
techniques for an associated matrix inversion. We observed that the size of the optimization problem increases 
exponentially in the number of variables and the number of constraints. We showed that the proposed scheme 
converges to a KKT point of the optimization problem. This was seen to be sufficient in simulations performed 
for the example problem of terrain exploration. However, in general, we showed in Section 5 that it may not be 
sufficient for a scheme to converge to any KKT point as the same may not correspond to a Nash equilibrium. The 
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results discussed in Section 5 can be easily generalized to the case where there are more than two players. In 
summary, usual gradient schemes could possibly suffer from two issues: (i) Non-convergence to Nash equilibria 
which is the more serious of the two issues, and (ii) scalability to higher problem sizes. In would be interesting to 
derive gradient-based algorithms that provide guaranteed convergence to Nash equilibria. 
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